Chapter 5 — A Cherry Picked

Again we face a problem. We know there was a person called Willem, and this person's name is the first word in the text we are OO modeling. As decent scholars creating an edition we want to annotate the first word with what is known about this person. Ideally we would mine this information from many sources, but that requires time. For this proof of concept I will revert to mining the references to Willem made in the introduction on the text of Reynaert the Fox by André Bouwman and Bart Besamusca in their edition of the Medieval 'animal epic'.

On cutting corners and cherry picking

The approach I take is—given that this is the first draft of this notebook—in places rather crude and cutting corners. First of all I am only perusing one possible source for information on Willem. I would like to take in the many editions there are (cf. Wackers 2006) and the Internet sources that have come into being, but all this must wait until later. As said, here I will only take annotative information from the latest edition by Bouwman and Besamusca.

I will not be applying sophisticated NLP technologies (such as those offered by OpenRefine or NLKT). This is because in general these methods lack sufficient precision in the case of a single (and not too long) text, such as we have here. Even machine learning approaches to understanding the patterns existing in the (Dutch) body of literature surrounding Of Reynaert the Fox are likely not to reveal too much useful information, because even if this body is sizable to humanities researchers, it is not really the big data needed for machine learning technologies to jump beyond a mere 77% or lower precision in recall.

Another reason to not apply NLP and AI methods right away, is that this series of notebooks is aimed at discovering the hermeneutic potential of notebooks and computer code, for which one would want to stay away as much as possible from reductive statistical techniques, at first at least.

Cutting corners in practice

For this chapter we will be using a tool called pdftotext to extract the text from the introduction of the edition of Reynaert the Fox by Bouwman et al. The aim of this series of notebooks is to object oriented model a text and to explore the limits of current reproducibility of scholarly activity related to scholarly editing. Therefore I want to shy away as much as possible from using serialized files. All too often these turn into sites of handcrafted annotative information and externalized knowledge. Ideally these notebooks build an edition (or at least a proof of concept) that could be called a computational edition. That difference with a digital edition being that it can (and should) be executed and only then result in an edition, constructed from available digital sources and without adding digital artefacts (such as transcriptions, annotations, comments, etc.) that are handcrafted by the scholarly editor. Serialization in what ever from invites these acts of externalizing knowledge but not the act of creating knowledge. That is why I refrain as much as possible from using files even as intermediate media. In the next class (module actually, but we can forgo on detailing the distinctiveness of these for now) I usually use a stored representation of the edition by Bouwman and Besamusca rather than that each time I hit the OAP website to download the PDF of that edition. This is purely to not cause a blocking by the OAP server (which could be effected if I would be identified as an over user of their resources.)

Usually one would want to have a reference to the source of each annotation. As we only mine annotations from one places at the moment however, I have chosen to hard code this source reference. This is definitely a todo for a next iteration for improving the code.

A further form of cutting corners is found in the AnnotationGenerator.each_sentence() method. The status of this part of the code is, as any for that matter, debatable. It tries to split the text into sentences based on explicit text pattern heuristics. NLP people would surely frown upon this, as there is no learning component of the software, it just sheepishly repeats the heuristics I thought it worthwhile to determine sentence boundaries. The question here is: what is more adequately from a hermeneutic point of view? From a AI perspective I am probably cutting corners, and machine learning little. On the other hand, the regular expressions used and explicated may reveal more about how a scholar reads a text than any machine parsing algorithm at the moment is able to.

A last corner cut to be mentioned is the fact that I have mimicked 'looking for information on' as a scholarly act by simply selecting those sentences that contain the name "Willem". A scholar clearly uses more sophisticate perusing of literature then that. Again for proving the concept of OO modeling as a scholarly computational literacy, this will probably serve. Moreover, these heuristics may in a next iteration be made far more sophisticated, and thus a better model or simulation of such acts.

The code in this notebook requires some software components to be installed, thus the comments in the cell below…


In [ ]:
## Requirements

# This notebook requires Poppler (a PDF rendering library based on the xpdf-3.0 code base)
# If you say this at the command line ($>): $> pdftotext -
# And it gets you an answer similar to this: 
#      pdftotext version 0.42.0
#      Copyright 2005-2016 The Poppler Developers - http://poppler.freedesktop.org
#      Copyright 1996-2011 Glyph & Cog, LLC
# You are good to go
# If not you can install it with e.g. brew install poppler (Mac) or 
# check http://blog.alivate.com.au/poppler-windows/ for a Windows version
# see also https://poppler.freedesktop.org/

Next then we find the code for mining annotations from the introduction of Bouwman, A. & Besamusca, B., 2009. By itself this module does not do too much, but it will turn out to be an essential component of the models from chapter 4 to represent knowledge. The method AnnotationGenerator.get_sentences() is responsible for downloading and converting the pdf of the edition. It then calls on the method AnnotationGenerator.each_sentence() that determines on the basis of several weighing factors which period represents an actual sentence ending. AnnotationGenerator.get_annotations_for( str ) simply filters that result for sentence containing the value of variable 'str'. Lasty a small helper class Annotation is defined. This class doesn't do too much of anything, but provides some syntactical clarity, so that instead of "annotation[0]" we can write "annotation.source". Computer scientists usuallay frown on such 'data classes' and 'syntactic sugar'. I think it identifies one of the many points where humanities researchers and computer scientists would differ on opinion of the hermeneutic qualities of code.


In [2]:
require 'open-uri'

module AnnotationGenerator

  def AnnotationGenerator.get_sentences
    if @sentences == nil
      @sentences = []
      text = ""

      # NOTE: this will require poppler be installed#
      # Use the file variant if you don't want to hurt the OAPserver too much
      # File.open( File.join( File.dirname(__FILE__), "../notebook/resources/340003.pdf"), "rb" ) do |url_source|
      # Use this open-uri::open variant for the real thing
      open( "http://www.oapen.org/download?type=document&docid=340003" ) do |url_source|
        IO.popen( "pdftotext - -", mode='r+') do |io|
          io.write url_source.read
          io.close_write
          text << io.read
        end
      end

      # We can do with setting a title for the source manually for now.
      @annotations_source = "Bouwman, A. & Besamusca, B., 2009. Of Reynaert the Fox: Text and Facing Translation of the Middle Dutch Beast Epic Van den vos Reynaerde, Amsterdam: Amsterdam University Press. Available at: http://www.oapen.org/search?identifier=340003 [Accessed November 20, 2015]."

      # Next thing I hold equivalent to scholarly 'only reading the introduction'.
      end_index = text.index( /39\s+Text, translation and note/ )

      # We don't need line breaks…
      text = text[0..end_index].gsub( "\n", " " )

      # Split text into sentences.
      each_sentence( text ) { |sentence| @sentences.push( sentence ) }

      # There's weird whitespace artefacst in the text at line beginnings,
      # let's remove those.
      @sentences.map!{ |sentence| sentence.gsub( /^[^\p{Word}]*/, "") }
    end
    @sentences.clone
  end

  # Split into sentences.
  # Crude! NLTK might perform better, but this serves to demonstrate the point.
  # Reduces the amount of errors in splitting the text
  # into sentences by marking as much as possible periods
  # that are related to acronymic or other non sentence ending use.
  def AnnotationGenerator.each_sentence( text, report_analytics=false )
    sentence = ""
    while text.match( /(\S+)\.(\s+\S+\s+)/ ) != nil do
      last_match = Regexp.last_match
      sentence << last_match.pre_match
      score_parts = []
      # +2 control characters
      score_parts.push( last_match[0].match( /[\u0006\u0010\u0011\u0012\u0013\u0015\u000E]+/ ) != nil ? 2 : 0 )
      # +2 period is followed by a whitespace and Captial UNLESS ms. or cf.
      score_parts.push( ( (last_match[1].match( /\b[Mm]s|[\bCc]f/ ) == nil) && (last_match[0].match( /\S+\.\s+\p{Lu}/ ) != nil) ) ? 2 : 0  )
      # +1 Most likely page reference
      score_parts.push( last_match[0].match( /\(\d+-\d+\)/ ) != nil ? 1 : 0 )
      # +1.5 Most likely sentence followed by footnote or page number
      score_parts.push( last_match[0].match( /\d+.\.\s+\d+/ ) != nil ? 1.5 : 0 )
      # -1/n chars: the shorter the more likely it is a acronym
      score_parts.push( -1.0/last_match[0].match( /(\S+)\./ )[1].size )
      # -1 period is followed by a whitespace and lower case
      score_parts.push( last_match[0].match( /\S+\.\s+\p{Ll}/ ) != nil ? -1 : 0 )
      # -1 no vowls
      score_parts.push( ( last_match[0].match( /(\S+)\./ )[1].scan( /\w/ ) & "aeiou".scan( /\w/ ) ).size == 0 ? -1 : 0 )
      score = 0
      score_parts.each { |score_part| score+=score_part }
      sentence << last_match[1] << "."
      if report_analytics
        yield [sentence, score, score_parts, last_match[0] ]
      end
      if score > -0.2
        yield sentence if !report_analytics
        sentence = ""
      end
      text = last_match[2] << last_match.post_match
    end
  end

  def AnnotationGenerator.get_annotations_for( str )
    sentences = get_sentences
    sentences.reject! { |sentence| sentence.downcase.match( /#{str}\b/ ) == nil }
    sentences.map! { |sentence| Annotation.new( sentence, @annotations_source ) }
  end

end


class Annotation

  attr_accessor :text
  attr_accessor :source

  def initialize( text, source )
    @text = text
    @source = source
  end

end


Out[2]:
:initialize

Now we need to combine this AnnoationGenerator with the OO models we made in chapter 4. This is the topic of chapter 6.

References

Bouwman, A. & Besamusca, B., 2009. Of Reynaert the Fox: Text and Facing Translation of the Middle Dutch Beast Epic Van den vos Reynaerde, Amsterdam: Amsterdam University Press. Available at: http://www.oapen.org/search?identifier=340003 [Accessed November 20, 2015].

Wackers, P., 2006. Editing “Van den Vos Reynaerde.” Variants, 5, pp.260–276.

</small>


In [ ]: